This report explores a dataset containing 4,898 white wines with 11 variables quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
The aim of this report is to find which chemical properties influence the quality of the White wine
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median : 5.200 Median :0.04300 Median : 34.00 Median :134.0
## Mean : 6.391 Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :65.800 Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Our dataset contains 13 variables: 11 chemical properties, the wine rating (quality) and the wine code (X)
The distribution of the values along the different variables is visible in the summary table that includes the mean, the median, the min and max value and the 1st and 3rd interquartil.
This dataset seems not to have missing values, as we could see in the summary table the min value is higher than 0. Citric acid has a minimum value of 0, we will check later if it is only one or several wines with value 0 to determine if that is a missing value or a correct 0 value
## Wine per quality
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
The quality of the different wine is mainly 5 and 6. To analyze the distribution of the values we visualize them on some plots with variation on scale, bin width and specific focus on some values.
Lets see the acidity properties
## Count for citric acid (g / dm^3) low values
## 0 0.01 0.02 0.03 0.04 0.05
## 19 7 6 2 12 5
Fixed acidity of most wines moves between 6.3 and 7.3 in a normal distribution. There are some wines with very low fixed acidity (left) and some wines with very high fixed acidity values (right tail)
Volatile acidity of most wines moves between 0.8 and 1.1 in a normal distribution. There are some wines with higher values (right tail)
Citric acidity of most wines moves between 0.27 and 0.39 but there are more around 50 wines with values lower than 0.05 and 19 with value = 0 that we consider as right data and not missing value
Residual sugar is mainly between 0 and 10 but there are some values over 10 and a max value of 65.8. The sugar curve is skewed to the right, if we apply a log10 scale we can see that we have two normal curves
Chlorides, free sulfur and total sulfur have normal distribution with some values on the right tail
# Histograms - Density
# Histogram density
d1=qplot(x = density, data = WWQ, binwidth = 0.001, color=I("white"),
xlab= 'density (g / cm^3)',
ylab= 'Count') +
scale_x_continuous (limits = c(0.9870, 1.0391),breaks = seq(0.98, 1.040, 0.01))
d2=qplot(x = density, data = WWQ, binwidth = 0.00001, color=I("blue"),
xlab= 'density (g / cm^3)',
ylab= 'Count') +
scale_x_continuous (limits = c(0.9915, 0.9965),breaks = seq(0.991, 1.0997, 0.001))
# Histogram density
d3=qplot(x = density, data = WWQ, bins = 30, color=I("white"),
xlab= 'density (g / cm^3)',
ylab= 'Count') +
scale_x_continuous (limits = c(0.9870, 1.0391),breaks = seq(0.98, 1.040, 0.01))
summary(WWQ$density)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
d4=qplot(x = density, data = WWQ, bins=30, color=I("white"),
xlab= 'density (g / cm^3)',
ylab= 'Count') +
scale_x_continuous (limits = c(0.9917, 0.9961),breaks = seq(0.991, 0.9961, 0.001))
grid.arrange(d1, d2, d3,d4, ncol=2)## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2170 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2406 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Density values moves from 0.9871 to 1.0340, with a interquartil of ~0.0057. We fix the limits on the interquartil distance and play with the scale, the bin width and breaks to get good visibility of the density distribution that in detail show that some density are more frequent than others even when the big picture is a normal distribution
pH and Sulphates has normal distribution
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol has a normal distribution with a long right tail. 75% of the wines have less than 11.4% of alcohol. The wine with biggest alcohol have 14.20%
The boxplot show the values distribution on the different properties
Some properties have outliers, it is possible to remove outlier for individually or after a multivariate analysis.
Removing outlier for some of the properties give a different interpretation of the property but it is more interesting if done with multiproperties analysis It’s important to note that we may not always be interested in the bulk of the data. Sometimes, the outliers are of interest, and it’s important that we understand their values and why they appear in the data set.
The data set contains 4898 observations with 13 variables: 11 chemical properties, the wine rating (quality) and the wine code (X)
The chimical properties are:- fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol.
At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
The distribution of the values along the different variables is visible in the summary table that includes the mean, the median, the min and max value and the 1st and 3rd interquartil.
This dataset have no missing values, as we could see in the summary table the min value is higher than 0, except for citric acid but as the values for that moves from 0 to 1.66 and then 0 as min value seems right value and not a mistake
The main feature is the quality of the wine that is a result of the chimical properties, knowing that is the mix of those chimical properties that generate a sinegy in the taste of the wine.
The quality of the different wine is mainly 5 and 6. To analyze the distribution of the values we visualize them on some plots with variation on scale, bin width and specific focus on some values.
Fixed acidity of most wines moves between 6.3 and 7.3 in a normal distribution. There are some wines with very low fixed acidity (left) and some wines with very high fixed acidity values (right tail)
Volatile acidity of most wines moves between 0.8 and 1.1 in a normal distribution. There are some wines with higher values (right tail)
Citric acidity of most wines moves between 0.27 and 0.39 but there are more than 200 wines with values around 0.5 and the even higher values (right tail)
Residual sugar is mainly between 0 and 10 but there are some values over 10 and a max value of 65.8. We create a plot of values from 0 to 10 to see in detail the residual sugar distribution.The sugar curve is skewed to the right
Chlorides, free sulfur and total sulfur have normal distribution with some values on the right tail
Density values moves from 0.9871 to 1.0340, with a interquartil of ~0.0057. We fix the limits on the interquartil distance and play with the scale, the bin width and breaks to get good visibility of the density distribution that in detail show that some density are more frequent than others even when the big picture is a normal distribution
pH and Sulphates has normal distribution
Alcohol has a normal distribution with a long right tail. 75% of the wines have less than 11.4% of alcohol. The wine with biggest alcohol have 14.20%
The chimical features that are more interesting are the ones with a larger range and different values eg. residual sugar, alcohol
Residual sugar is mainly between 0 and 10 but there are some values over 10 and a max value of 65.8. We create a plot of values from 0 to 10 to see in detail the residual sugar distribution.The sugar curve is skewed to the right
Density values moves from 0.9871 to 1.0340, with a interquartil of ~0.0057. We fix the limits on the interquartil distance and play with the scale, the bin width and breaks to get good visibility of the density distribution that in detail show that some density are more frequent than others even when the big picture is a normal distribution
Some properties have outliers, it is possible to remove outlier for individually or after a multivariate analysis.
Removing outlier for some of the properties give a different interpretation of the property but it is more interesting if done with multiproperties analysis It’s important to note that we may not always be interested in the bulk of the data. Sometimes, the outliers are of interest, and it’s important that we understand their values and why they appear in the data set
The analysis of two variable start with a plot of correlations among all the variables
The quality of the wine is positive correlated with the alcohol and negative correlated with the density. The residual sugar is positive correlated with the density and negative correlated with the alcohol.
Distribution of the properties in different levels of quality for the properties with hihg correlation (+ or -)
The values related to alcohol are not homogeneous for the different qualities, similar happens with density (without outlier) and other properties that has + or - correlation with quality.
Lest see that with more details and with the mean and the quantile for each one with outlier removed for density, chlorides and volatile acidity
The alcohol is higher for the better quality and the density is lower for the better quality
Lets focus on quality labels 5, 6 and 7 as are the most common
Lets see quality relation with alcohol and density in another visualization
It is more clear now with this visualization, then a good wine combine high level of alcohol and low density
Regarding relation between features we can observe the relation of the density with total sulfur.dioxide, residual sugar, fixed acidity and chlorides
Lower density, better quality. Low total sulfites, low fixed acidity, low chlorides and low residual sugar linked to a lower density, ergo, a better win
The stronger relations are between density and alcohol (negative correlation of 78%) and density and residual sugar (positive correlation of 84)
The quality of the wine is positive correlated with the alcohol and negative correlated with the density. The residual sugar is positive correlated with the density and negative correlated with the alcohol.
Lower density means better quality. Low total sulfur dioxide, low fixed acidity, low chlorides and low residual sugar linked to a lower density, ergo, a better wine. This point is to be investigated in the multivariate analysis
The stronger relations are between density and alcohol (negative correlation of 78%) and density and residual sugar (positive correlation of 84%)
Density has strong positive correlation with residual sugar and total sulfur and moderate with chlorides and fixed acidity
Alcohol has strong negative correlation with density and moderate negative correlation with total sulfur, residual sugar and chlorides
Residual sugar has moderate positive correlation with total sulfur and negative with alcohol and strong positive correlation with density
Chlorides has as moderate positive correlation with density and negative with alcohol
The fixed acidity, the volatile acidity and the citric acid have low correlation with the other variable, except for pH.
Free sulfur has low correlation with other variables except for total sulfur
Total sulfur has moderate positive with density and negative with alcohol
pH has no correlation except negative with fixed acidity
Sulfates has no correlation with any variable
It will be interesting to investigate the data with multivariate plots
The analysis multivariate of white wine is going to be done to analyze the distribution of the wine per quality in relation of - alcohol and density, - density and residual sugar, - alcohol and chlorides, - alcohol and total sulfur dioxides,
Those 4 chart represent the same information but none of them are really clear, they are good examples of what important the colors, shape and size of the chart elements are basic
It is clear that there is a movement from left to right and up down along the quality of the wines, it means that better wines have less density and more alcohol but the line between a wine quality 5, 6 and 7 is not crystal clear
It is not so good visually but it is clear that low residual sugar and low density are good indications of a good wine
It is clear that there is a movement from right to left and top to down along the quality of the wines, it means that better wines have less density and less sugar but the line between a wine quality 5, 6 and 7 is not crystal clear
Lower density means better quality. Low total sulfur dioxide, low fixed acidity, low chlorides and low residual sugar linked to a lower density, ergo, a better wine. That is like that for all white wines. The analysis of the different white wines characteristic per quality does not give clear differences between them. Alcohol and residual sugar seams to interact and more we move from a bad wine to a better wine the figure moves to the less sugar and more alcohol
This plot is the basic one that gives the information about what is the basic structure of the data, in a simple view we see that the distribution of the wine quality is a normal curve.
Next step is try to find what are the characteristic that makes a wine be in a category 5 or 6, what are the combinations of characteristic that could allow to define if a wine is 5 or 9.
The two chemical characteristic that have more correlation with the quality of the wine are the alcohol (+44%) and the density (-31%). Those correlation levels are not strong
If we make a focus on the chart to see the interquantile space, the distribution of those characteristic in the different quality categories do not show big differences, the lines representing the linear regression for each category are very close and nearly parallel
We can see points of quality 5 and 6 all along and those are the most frequent wines
These two chart are showing the the relation between quality, chlorides and total sulfur dioxide. The only difference is that the first show the whole perimetre of the dataset, including outliers and the second one show the data interquantile (25%-75%) It is important when analysing data to take into acount what is the impact of the outliers in the result. The first chart could induce to think that there are really big differences on wine quality related to those two characteristic, in the second chart we can see that does differences are not so big.
The analysis of the 4898 white wines show that there are not much bad wine (183). Nearly the half of the wines are considered as 6 and a ~1500 as 5, both values are for good wines. There are less than 1000 wine that are valuate at 7 and only 5 as 9.
It is difficult to find a pattern or a characteristic that makes the difference between the wines, what could induce to think that there are other characteristic that makes the difference but are not considered in the dataset. The color, the flavor and the smell of a wine depends not only in the quantity of alcohol and sugar but which is the process of maturity of the wine, the year of collection of the wrapes and other linked with the date of taste and the expert situation.
The fact that there were 3 expert that taste the wine prevent a possible bias
The conclusion after analyzing the data is that if we go to a shop and buy a white wine we have more than 80% probability to buy a good wine. Unless you are a bad lucky person.
One possible future analyse could be done taking a sample of wines from different quality score but similar values on the chemical characteristic and make a details analysis of the variations between them that makes it to be in an score and not in the other
Another possible analysis could be done taking the 3 originals score of each wine and generate a data set of 3 times the size of this data set but keeping the information related to each wine, it means we can see the 3 score of each wine and trate the information as different wine. That could help to find the key for a wine that can be score 5 for one and 7 for another, for example
Personal reflection: a good white wine and a great white wine could have same chemical properties because what makes a whine great is the spirit inside. Personally I prefer the sweet wines with low level of alcohol, then probably I like more a wine 6 than a wine 7 or 8
https://www.tidyverse.org/packages/
http://www.sthda.com/english/wiki/colors-in-r
https://humansofdata.atlan.com/2018/03/when-delete-outliers-dataset/
https://statsandr.com/blog/outliers-detection-in-r/
https://stats.idre.ucla.edu/r/faq/how-can-i-explore-different-smooths-in-ggplot2/
https://rpubs.com/profversaggi/lesson_four_problem_set
https://blog.rstudio.com/2014/01/17/introducing-dplyr/
https://rpubs.com/profversaggi/exploring_multivariate_data
https://www.r-graph-gallery.com/75-split-screen-with-layout.html
http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf?utm_source=twitterfeed&utm_medium=twitter
https://ggplot2.tidyverse.org/reference/scale_brewer.html
https://www.statmethods.net/advgraphs/axes.html
https://github.com/wengsengh/Exploratory_Data_Analysis/blob/master/wineQualityWhites.rmd
http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually
http://mml.citi.sinica.edu.tw/cahou/RReport.html
https://rpubs.com/prasad_pagade/wine_quality_prediction
https://ggplot2.tidyverse.org/reference/scale_gradient.html
https://medium.com/@wengsengh/wine-quality-exploration-with-r-dca52264dca8
https://online.stat.psu.edu/stat508/book/export/html/804